1 Executive Summary

This report explored three research questions based on the “vgsales” dataset. In the first research question we analyzed the publisher’s popularity from various aspects such as the total sales of publishers’ games, the sales of each region, and the total number of publishers’ games. Finally, we believe that Nintendo is the most popular publisher. In the second research question, we compared the contributions of the JP market and NA market to the top 4 most popular publishers. In the research question three we found that the platformer genre is one of the best globally selling game genres. However, the game with the highest number of global sales is from the sports game genre.


2 Full Report

2.1 Initial Data Analysis (IDA)

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.1.2     ✓ dplyr   1.0.6
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggplot2)
vgsales = read.csv("vgsales.csv")

Our data comes from VGChartz which is a very famous website that counts game sales. For software sales, VGChartz will record official sales data, which is generally provided by the publisher or developer. For hardware sales, VGChartz will make weekly estimates based on retail sampling and trends in individual regions, which are then extrapolated to represent the wider region. At the same time, VGChartz will compare and amend their data based on the official sales volume and the data of other trackers with wider coverage. Thus this dataset is reliable.

Obviously, the data collected by VGChartz has some flaws. First of all, all the data presented by VGChartz are estimates. Although according to statistics, they can always keep the estimated value and the real data within 10% error. Secondly, their statistics on software data need to rely on the data provided by the publisher. Some publishers do not provide data to VGChartz. These data will not be counted in our data.

This project explores which game publishing company is the most popular which could be very beneficial for game developers looking into getting their games published by a company. This project compares the sales from North America, Europe, and Japan which can greatly benefit huge companies or smaller game developers when deciding a certain demographic for their game. Lastly, this project compares the best globally selling game genres which can be beneficial for both small game developers or big game companies to help decide which genre to pick.

str(vgsales)
## 'data.frame':    16598 obs. of  11 variables:
##  $ Rank        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Name        : chr  "Wii Sports" "Super Mario Bros." "Mario Kart Wii" "Wii Sports Resort" ...
##  $ Platform    : chr  "Wii" "NES" "Wii" "Wii" ...
##  $ Year        : chr  "2006" "1985" "2008" "2009" ...
##  $ Genre       : chr  "Sports" "Platform" "Racing" "Sports" ...
##  $ Publisher   : chr  "Nintendo" "Nintendo" "Nintendo" "Nintendo" ...
##  $ NA_Sales    : num  41.5 29.1 15.8 15.8 11.3 ...
##  $ EU_Sales    : num  29.02 3.58 12.88 11.01 8.89 ...
##  $ JP_Sales    : num  3.77 6.81 3.79 3.28 10.22 ...
##  $ Other_Sales : num  8.46 0.77 3.31 2.96 1 0.58 2.9 2.85 2.26 0.47 ...
##  $ Global_Sales: num  82.7 40.2 35.8 33 31.4 ...
Year = as.numeric(vgsales$Year)
## Warning: NAs introduced by coercion

This dataset includes 16600 observations with 11 variables including Rank, Name, Platform, Year, Genre, Publisher, NA_Sales, EU_Sales,JP_Sales, Other_Slaes, and Global_Sales. The ‘Rank’ variable has the integer class and it assigns each observation an id number. ‘Name’, ‘Platform’, ‘Genre’, and ‘Publisher’ variables have the character class and give the name of the game, platform the game runs on, genre of the game, and name of the publisher, respectively. ‘Year’ variable initially had the character class but was changed to the numeric class; this variable represented the year in which the game was released. The ‘NA_Sales’, ‘EU_Sales’, ‘JP_Sales’, and ‘Global_Sales’ variables have the numeric class and represent the North American sales, European sales, Japanese sales, and global sales, respectively.


2.2 Research Question 1

Which game publisher is most popular?

vgsales %>%                        # Using the `vgsales` data, we will
  select(Publisher, Global_Sales) %>%   # select the variables `Publisher` and `Global_Sales`, then
  group_by(Publisher) %>%                # group by `Publisher`, then
  summarise(Global_Sales = sum(Global_Sales)) %>% # summarise each group by the sum `Global_Sales`, then
  arrange(desc(Global_Sales))%>%
  slice(1:5)   %>%            #sort the top 5 Publisher
  ggplot(aes(x = Publisher, y = Global_Sales)) + # plot the results using `ggplot()`
  geom_bar(stat = "identity") +
  xlab("Publisher") +
  ylab("Global_Sales per million")

We first add up the global sales of all games for each publisher. Then sort them and intercept the top five publishers to make a barplot. From the statistics, we can see that Nintendo takes first place with an absolute advantage. It has about 600 million more sales than the second company Electronic Arts.

vgsales %>%                        # Using the `vgsales` data, we will
  select(Publisher, NA_Sales) %>%   # select the variables `Publisher` and `NA_Sales`, then
  group_by(Publisher) %>%                # group by `Publisher`, then
  summarise(NA_Sales = sum(NA_Sales)) %>% # summarise each group by the sum `NA_Sales`, then
  arrange(desc(NA_Sales))%>%
  slice(1:5)   %>%            #sort the top 5 Publisher
  ggplot(aes(x = Publisher, y = NA_Sales)) + # plot the results using `ggplot()`
  geom_bar(stat = "identity") +
  xlab("Publisher") +
  ylab("NA_Sales per million")

In North America, Nintendo still occupies the first place, but the gap with Electronic Arts is significantly reduced, about 300 million more in sales.

vgsales %>%                        # Using the `vgsales` data, we will
  select(Publisher, EU_Sales) %>%   # select the variables `Publisher` and `EU_Sales`, then
  group_by(Publisher) %>%                # group by `Publisher`, then
  summarise(EU_Sales = sum(EU_Sales)) %>% # summarise each group by the sum `EU_Sales`, then
  arrange(desc(EU_Sales))%>%
  slice(1:5)   %>%            #sort the top 5 Publisher
  ggplot(aes(x = Publisher, y = EU_Sales)) + # plot the results using `ggplot()`
  geom_bar(stat = "identity") +
  xlab("Publisher") +
  ylab("EU_Sales per million")

In Europe, the sales gap between Nintendo and Electronic Arts has been further narrowed. It is worth mentioning that,compare with the sales in North America, the narrowing of the gap is not because of the higher sales of several other Publishers, but Nintendo’s sales fell more when all publishers’ sales falling.

vgsales %>%                        # Using the `vgsales` data, we will
  select(Publisher, JP_Sales) %>%   # select the variables `Publisher` and `JP_Sales`, then
  group_by(Publisher) %>%                # group by `Publisher`, then
  summarise(JP_Sales = sum(JP_Sales)) %>% # summarise each group by the sum `JP_Sales`, then
  arrange(desc(JP_Sales))%>%
  slice(1:5)   %>%            #sort the top 5 Publisher
  ggplot(aes(x = Publisher, y = JP_Sales)) + # plot the results using `ggplot()`
  geom_bar(stat = "identity") +
  xlab("Publisher") +
  ylab("JP_Sales per million")

As a local company, Nintendo has an overwhelming advantage in sales in Japan. We think this is why Nintendo has such a big advantage in total sales. Of course, this can not deny the success and popularity of Nintendo games. The few companies that rank high in sales in Japan are all local companies, but their sales still lag far behind Nintendo.

vgsales %>%                        # Using the `vgsales` data, we will
  select(Publisher, Other_Sales) %>%   # select the variables `Publisher` and `Other_Sales`, then
  group_by(Publisher) %>%                # group by `Publisher`, then
  summarise(Other_Sales = sum(Other_Sales)) %>% # summarise each group by the sum `Other_Sales`, then
  arrange(desc(Other_Sales))%>%
  slice(1:5)   %>%            #sort the top 5 Publisher
  ggplot(aes(x = Publisher, y = Other_Sales)) + # plot the results using `ggplot()`
  geom_bar(stat = "identity") +
  xlab("Publisher") +
  ylab("Other_Sales per million")

This chart shows sales in regions other than the above three regions. In this chart, Electronic Arts surpassed Nintendo’s sales for the first time.

vgsales %>%    
  count(Publisher)  %>%  
  arrange(desc(n))%>%
  slice(1:5)   %>%  
  ggplot(aes(x = Publisher, y = n)) + # plot the results using `ggplot()`
  geom_bar(stat = "identity") +
  xlab("Publisher") +
  ylab("Number of games created")

Finally, this chart shows the top five publishers with the highest number of published games. The publisher with the highest number of games released is Electronic Arts, which released 1351 games. Nintendo ranked seventh and released 703 games.

Based on all the above analysis, we believe that Nintendo is the most popular game publisher. Even though they don’t publish a lot of games, they still dominate the top sales in most regions. Nintendo was founded in 1889 and got into the video game business in 1970. As a very old video game company, Nintendo’s game consoles have accompanied the growth of many people. They are very good at making famous game characters and making related products around these characters, such as Super Mario.

2.3 Research Question 2

How does the JP market contribute to total sales compare to NA market for the top 4 most popular publishers?

library(RColorBrewer)
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Warning in doTryCatch(return(expr), name, parentenv, handler): unable to load shared object '/Library/Frameworks/R.framework/Resources/modules//R_X11.so':
##   dlopen(/Library/Frameworks/R.framework/Resources/modules//R_X11.so, 6): Library not loaded: /opt/X11/lib/libSM.6.dylib
##   Referenced from: /Library/Frameworks/R.framework/Versions/4.0/Resources/modules/R_X11.so
##   Reason: image not found
## Warning in system2("/usr/bin/otool", c("-L", shQuote(DSO)), stdout = TRUE):
## running command ''/usr/bin/otool' -L '/Library/Frameworks/R.framework/Resources/
## library/tcltk/libs//tcltk.so'' had status 1
## Could not load tcltk.  Will use slower R code instead.
## Loading required package: RSQLite
myPalette <- brewer.pal(5, "Set2") 
sell_lable <- c("NA_Sales", "EU_Sales", "JP_Sales", "Other_Sales")

This dataset provided us the JP, NA, EU and other sales, and we’ve decided to figure out what is the JP market contribution compare to NA sells in top 4 most popular publisher.

The figure 2.1 displays the pie chart of Nintendo, one of the biggest and most famous publisher from Japan in the world. According to the research from worldometers.info, NA has almost 3 times population compared to Japan, however, the NA sells, takes 46% of global sells, is only twice as the JP sells, indicates that japanese people contributes more sells per person than north America.

Nintendo <- sqldf("select sum(NA_Sales), sum(EU_Sales),sum(JP_Sales),sum(OTher_Sales) from vgsales where Publisher = 'Nintendo'")

Nintendo_Pie <- data.frame(lable = c("NA_Sales", "EU_Sales", "JP_Sales", "Other_Sales"), data = c(Nintendo[,1], Nintendo[,2],Nintendo[,3],Nintendo[,4]))

Nintendo_label_value <- paste('(', round(Nintendo_Pie$data/sum(Nintendo_Pie$data) * 100, 1), '%)', sep = '')
label <- paste(Nintendo_Pie$lable, Nintendo_label_value, sep = '')

ggplot(Nintendo_Pie, aes(x="", y=data, fill=lable)) +
  geom_bar(stat="identity", width=1, position = 'stack') +
  coord_polar("y", start=0)+
  labs(x='', y='', title = 'Nintendo Sales Pie Chart')+
  theme(axis.text = element_blank()) + 
  theme(axis.ticks = element_blank()) +
  scale_fill_discrete(labels = label)

Figure 2.1 Nintendo Sell Pie Chart

The figure 2.2 demonstrated the pie chart of Electronic arts sells.Electronic arts, known as EA, is an famous american publisher. The situation here is quite different compared to Nintendo. The NA market contributes 54% of the global sells while the JP market only contributes 2% of the global sells. Even though NA population is third times as JP population, the percentage that JP market contributed is way lower than expected.

Figure 2.2 EA sell pie chart

EA <- sqldf("select sum(NA_Sales), sum(EU_Sales),sum(JP_Sales),sum(OTher_Sales) from vgsales where Publisher = 'Electronic Arts'")

EA_Pie <- data.frame(lable = c("NA_Sales", "EU_Sales", "JP_Sales", "Other_Sales"), data = c(EA[,1], EA[,2],EA[,3],EA[,4]))

EA_label_value <- paste('(', round(EA_Pie$data/sum(EA_Pie$data) * 100, 1), '%)', sep = '')
label <- paste(EA_Pie$lable, EA_label_value, sep = '')

ggplot(EA_Pie, aes(x="", y=data, fill=lable)) +
  geom_bar(stat="identity", width=1, position = 'stack') +
  coord_polar("y", start=0)+
  labs(x='', y='', title = 'EA Sales Pie Chart')+
  theme(axis.text = element_blank()) + 
  theme(axis.ticks = element_blank()) +
  scale_fill_discrete(labels = label)

The figure 2.3 displays the pie chart of Activision. Activision, same as EA, is an american publisher in the state of California. Similar to EA, the NA contribution is large, up to 60% of the global sells, and the JP market contributed few. What interested is, the pie chart indicated that JP market contributed 0% of the global sell, however, JP still did a little contribution. In fact, Activision have less than 1% of global in japan market, and it is considered as 0%.

Figure 2.3 Activision Sell Pie Chart

Activision <- sqldf("select sum(NA_Sales), sum(EU_Sales),sum(JP_Sales),sum(OTher_Sales) from vgsales where Publisher = 'Activision'")

Activision_Pie <- data.frame(lable = c("NA_Sales", "EU_Sales", "JP_Sales", "Other_Sales"), data = c(Activision[,1], Activision[,2],Activision[,3],Activision[,4]))

Activision_label_value <- paste('(', round(Activision_Pie$data/sum(Activision_Pie$data) * 100, 1), '%)', sep = '')
label <- paste(Activision_Pie$lable, Activision_label_value, sep = '')

ggplot(Activision_Pie, aes(x="", y=data, fill=lable)) +
  geom_bar(stat="identity", width=1, position = 'stack') +
  coord_polar("y", start=0)+
  labs(x='', y='', title = 'Activision Sales Pie Chart')+
  theme(axis.text = element_blank()) + 
  theme(axis.ticks = element_blank()) +
  scale_fill_discrete(labels = label)

Figure 2.4 is the pie chart of Sony sell. Sony is one of the most famous publisher from Japan who have published not only games, but also game console known as playStation over the world. In sony, the NA market contributed 44% of the sellage and JP market contributed 12% of total sells.Even though the JP market contributed a considerable sellage, however, it is still below expected.

Figure 2.4 Sony Sell Pie Chart

Sony <- sqldf("select sum(NA_Sales), sum(EU_Sales),sum(JP_Sales),sum(OTher_Sales) from vgsales where Publisher = 'Sony Computer Entertainment'")

Sony_Pie <- data.frame(lable = c("NA_Sales", "EU_Sales", "JP_Sales", "Other_Sales"), data = c(Sony[,1], Sony[,2],Sony[,3],Sony[,4]))

Sony_label_value <- paste('(', round(Sony_Pie$data/sum(Sony_Pie$data) * 100, 1), '%)', sep = '')
label <- paste(Sony_Pie$lable, Sony_label_value, sep = '')

ggplot(Sony_Pie, aes(x="", y=data, fill=lable)) +
  geom_bar(stat="identity", width=1, position = 'stack') +
  coord_polar("y", start=0)+
  labs(x='', y='', title = 'Sony Sales Pie Chart')+
  theme(axis.text = element_blank()) + 
  theme(axis.ticks = element_blank()) +
  scale_fill_discrete(labels = label)

2.4 Research Question 3

What is the best globally selling game genre?

This data set is a list of games divided into several genres with each game having a global sales value represented in millions. The two main biases which need to be addressed regarding this research question for this dataset are the number of games in each genre and the popularity of the company releasing a game. These two biases can heavily skew the results for this research question.

p1 = ggplot(vgsales, aes(x = reorder(Genre, +Global_Sales, FUN = median), y = Global_Sales, fill = Genre)) + geom_boxplot(outlier.shape = NA)
p1 = p1 + ylim(c(0,2))
p1 = p1 + theme_minimal()
p1 = p1 + theme(legend.position = "none")
p1 = p1 + labs(title = "Box Plots of Global Sales for Different Game Genres")
p1 = p1 + labs(x = "Game Genre", y = "Global Sales in Millions")
p1 = p1 + labs(caption = "Figure 1")
p1 = p1 + stat_boxplot(geom = "errorbar", width = 0.15)
p1 = p1 + stat_summary(fun = mean, geom = "point", colour = "Black", size = 3)
p1 = p1 + coord_flip()
p1
## Warning: Removed 846 rows containing non-finite values (stat_boxplot).

## Warning: Removed 846 rows containing non-finite values (stat_boxplot).
## Warning: Removed 846 rows containing non-finite values (stat_summary).

Figure 1 shows a general comparison between game genres, ranking them based on their median value. Looking at this, it is evident that the platform genre has the highest median, upper quartile, and upper whisker value. Figure 1 also compares the mean of different genres represented by a black dot, platform again having the highest value. However, figure 1 has a limitation since the data has huge variations and therefore a lot of outliers. These outliers are not shown in figure 1.

p2 = ggplot(vgsales, aes(x = reorder(Genre, +Global_Sales, FUN = median), y = Global_Sales, fill = Genre)) + geom_point(shape = 21, size = 3)
p2 = p2 + theme_minimal()
p2 = p2 + theme(legend.position = "none")
p2 = p2 + labs(title = "Global Sales for Different Game Genres")
p2 = p2 + labs(x = "Game Genre", y = "Global Sales in Millions")
p2 = p2 + labs(caption = "Figure 2")
p2 = p2 + coord_flip()
p2

Figure 2 shows all of the data set as points on the graph. The data is very spread apart, and the highest global sale is of a game in the sports genre, followed by a game in the platform genre. Platform genre generally has more games at the higher end for global sales which makes it one of the best globally selling game genres.

Thus, the platform genre is the best globally selling game genre. According to Statisa (Clement, 2021), the best-selling video game genre in the United States is Action, however, Action is close to the top 6th or 7th best globally selling genre according to this data. There is an obvious difference in these two studies, where Statisa looked at only the US sales, this study looked at the global sales.


3 References

Clement, J. (2021, May 5). Video game sales in the United States in 2018, by genre. https://www.statista.com/statistics/189592/breakdown-of-us-video-game-sales-2009-by-genre/
All Top Everything.(2021). Top 10 Biggest Video Game Companies in the World. https://www.alltopeverything.com/top-10-biggest-video-game-companies/
VGChartz. (2021).VGChartz Methodology. Data-Collection Methodology. https://www.vgchartz.com/methodology.php
worldometer.(2017). Population Comparison: China, EU, USA, and Japan. https://www.worldometers.info/population/china-eu-usa-japan-comparison/